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Abstract 

Background: Molecular interactions need to be taken into account to adequately model the complex behavior of 
biological systems. These interactions are captured by various types of biological networks, such as metabolic, 
gene-regulatory, signal transduction and protein-protein interaction networks. We recently developed Natalie, which 
computes high-quality network alignments via advanced methods from combinatorial optimization. 

Results: Here, we present NatalieQ, a web server for topology-based alignment of a specified query protein-protein 
interaction network to a selected target network using the Natalie algorithm. By incorporating similarity at both the 
sequence and the network level, we compute alignments that allow for the transfer of functional annotation as well 
as for the prediction of missing interactions. We illustrate the capabilities of NatalieQ with a biological case study 
involving the Wnt signaling pathway. 

Conclusions: We show that topology-based network alignment can produce results complementary to those 
obtained by using sequence similarity alone. We also demonstrate that NatalieQ is able to predict putative 
interactions. The server is available at: http://www.ibi.vu.nl/programs/natalieq/. 
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Background 

To adequately model complex behavior of biological 
systems one needs to take molecular interactions into 
account. These interactions are captured by various types 
of biological networks such as metabolic, gene-regulatory, 
signal transduction and protein-protein interaction (PPI) 
networks. Recent advances in technological develop- 
ments and computational methods have resulted in large 
amounts of network data. For instance, STRING [1], 
a database of experimentally verified and computation- 
ally predicted protein interactions, grew from 261,033 
proteins in 89 organisms in 2003 to 5,214,234 pro- 
teins in 1,133 organisms in January 2014. However, the 
development of solid methods for analyzing network 
data is lagging behind, particularly in the field of com- 
parative network analysis. Here, one wants to detect 
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commonalities between biological networks from differ- 
ent strains or species, or derived from different condi- 
tions. In contrast to traditional comparison at sequence 
level, topology-based comparison methods explicitly take 
interactions into account and are thus more suitable to 
compare networks. Subnetworks with shared interactions 
across species allow for improved transfer of functional 
annotations from one species to the other by using more 
information than sequence alone [2] . 

We have developed NatalieQ, a web server for accu- 
rate topology-based protein-protein interaction network 
queries. It provides an interface to the general network 
alignment method NATALIE [3,4], which is fast and sup- 
ports various scoring schemes taking both node-to-node 
correspondences and network topologies into account. 
Briefly, NATALIE views the network alignment problem as 
a generalization of the well-studied quadratic assignment 
problem and solves it using techniques from integer linear 
programming. 

Currently, only few web servers for comparative net- 
work analysis exist. The PathBLAST web server [5] 
reports exact and approximate hits in a target PPI network 
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for a user-defined simple query, expressed as a lin- 
ear path of up to five proteins. The NetworkBLAST 
web server [6] finds locally-conserved protein complexes 
between species-specific PPI networks. NetAligner [7], 
a recent web server, allows the comparison of user- 
defined networks or whole interactomes within a set 
of fixed species using a heuristic network alignment 
with no guarantees on the optimality of the identified 
solutions. 

Our contribution is twofold. First, NatalieQ employs 
a new scoring function to produce high-quality pair- 
wise alignments between a user-specified query network 
of arbitrary topology and interactomes of several model 
species and human. The score of an alignment is primar- 
ily based on the number of conserved interactions, while 
sequence similarity is used as a secondary, subordinate 
optimization goal. In addition, the alignments computed 
by the underlying NATALIE algorithm come with a qual- 
ity guarantee that often proves their optimality. Second, 
through an interactive visualization of the alignment, the 
user can quickly get an overview of conserved and non- 
conserved interactions and can use the protein descrip- 
tions of the nodes to assess the alignment. We illustrate 
a usage scenario of the web server on the Wnt signaling 
pathway and demonstrate that NatalieQ is able to pre- 
dict putative interactions that are not detected by other 
methods. 

Implementation 

Network alignment algorithm 

Natalie, the alignment method of NatalieQ, is applica- 
ble to any type of network and supports any additive score 
function taking both node-to-node correspondences and 
topology into account. Here, we take as input a pair of 
PPI networks whose nodes and edges correspond to pro- 
teins and their interactions. Let Gi = (Vi^Ei) and G2 = 
(V2>E2) be two PPI networks whose edges have a con- 
fidence value above a user-defined threshold Cmin- We 
denote by E(viy V2) the £- value of proteins v\ e Vi and 
V2 e V2 obtained by an all-against-all sequence alignment. 
Typically, Gi is a smaller query network such as a specific 
pathway of interest, and G2 is a large species-specific PPI 
network. 

A network alignment is a partial injective function a : 
Vi V2 with the additional requirement that if vi e Vi 
is aligned then (^(vi) G {v2 e V2 \ E(vi,V2) < ^max}- 
That is, every node vi G Vi is related to at most one node 
V2 G V2 with £- value E(vi, V2) below a pre-specified cut- 
off £inax and vice versa. We score the topology component 
of an alignment a as follows 



with 



1 if (a(u), a(v)) e E2, 
0 otherwise. 



This score is also known as edge correctness and denotes 
the fraction of edges from the smaller query network that 
have been aligned. The problem of global pairwise net- 
work alignment is to find the highest-scoring alignment. 
Should there be several alignments with the same max- 
imum edge correctness, we would prefer the alignment 
with the highest overall bit score as obtained by an all- 
against-all sequence alignment. We achieve this in the 
following way. Let b{vi, V2) € [0, 1] be the normalized bit 
score of aligning protein vi G Vi with protein V2 G ¥2- 
The total score of an alignment a is then 

1 

s{a) = t(a) -\- 



l + min{|£i|,|£2|}-min{|yi|,|y2|} 



ueVi 



(1) 



t(a) = 



min{|£i|,|£2|} 



uveEi 



That is, the score component is ensured to be strictly 
smaller than the score contribution of one conserved edge. 
Therefore ties among alignments with the same edge cor- 
rectness are broken in favor of those with the highest 
overall bit score. 

We use Natalie to compute alignments with maxi- 
mum total score. A specific feature of Natalie is that 
any identified solution comes with an upper bound on the 
optimal score value. In the NatalieQ setting with small 
query networks, the upper bound equals the score of the 
alignment found, thereby proving its optimality. The iden- 
tified alignment is not necessarily optimal if there is a 
gap between the score and the upper bound. In that case 
the relative size of the gap provides a bound on the error 
due to suboptimality. In a recent study [4] on aligning PPI 
networks of six different species, NATALIE was compared 
to state-of-the-art network alignment methods, evaluat- 
ing the number of conserved edges as well as functional 
coherence of the modules in terms of Gene Ontology 
annotation. The study established NATALIE as a top net- 
work alignment method with respect to both alignment 
quality and running time. 

Databases 

We currently provide eight model species from STRING 
[1] and IntAct [8] as target databases. We added tex- 
tual descriptions to the protein IDs. For the STRING 
networks, these descriptions are available as a separate 
publicly available download. We retrieved the protein 
descriptions for the IntAct networks by cross-referencing 
the IntAct UniProt identifiers with the Swiss-Prot and 
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TrEMBL databases [9]. To allow NatalieQ to take pro- 
tein sequence information into account, we stored the 
amino acid sequences of the proteins in separate FASTA 
files per network. We retrieved these sequences from the 
STRING and IntAct databases. The target databases will 
be updated upon new releases of STRING and IntAct. 

Processing 

NatalieQ computes a network alignment in a two-step 
fashion implemented in a Perl wrapper script. First, the 
wrapper invokes BLAST [10,11] to create pairwise protein 
alignments between the sequences corresponding to the 
nodes of the query and target network. Next, the wrap- 
per invokes Natalie [3,4] for different £-value cut-offs 
^max e {0, 10-10^ 10-^^ 10-1^ 1, 10, 100}. Each cut-off 
^max imposes restrictions on the allowed pairings, that 
is, only pairs (w, a{u)) with u e Vi whose £-value is at 
most^max are allowed. During these computations, which 
take a few minutes for a typical network query, the user is 
updated about the progress and may bookmark the unique 
web page for this run or leave an e-mail address to be 
notified upon completion. 

Results and discussion 

Web server 

The input of NatalieQ consists of a query network 
that can be in several formats: a simple edge list for- 
mat, Cytoscape s SIF format, IntAct s MITAB format or 
string's text-based format. The input file format is auto- 
matically detected. Optionally, the user can provide a 
FASTA file containing the protein sequences correspond- 
ing to the network nodes. In case no FASTA file is supplied 
and the node labels correspond to UniProt, RefSeq or 
GI identifiers, the corresponding sequences are retrieved 
automatically from the NCBI Protein database [12]. The 
user can select one of two well-known protein interac- 
tion databases (IntAct or STRING) and one of currently 
eight model species as target network. Options are the 
score function and the confidence threshold Cmin- We sup- 
port two score functions: topology, which is the scoring 
function as defined previously, as the default option, and 
sequence only, which results in the best network alignment 
in terms of sequence similarity, disregarding topological 
information. 

The output page first gives an overview of the results 
for the different £-value cut-offs (Figure 1). The user can 
select a result for detailed inspection. Interesting results 
to inspect are, for example, the one with best sequence 
similarity among the top-scoring topological similarities 
or the one with best topological score at lowest £-value 
cut-off. The detailed view starts with summary statistics 
about the input networks and the computational pro- 
cess (Figure 2). It then displays an interactive network 
alignment visualization using the Javascript D3 library 



Overview for Dme from String (topology) 
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Figure 1 Natali eQ computation overview of tlie alignments of 
the Wnt query networic against the target PPI networic (STRING) 
of D. melanogaster using the topology score function. 



(http://mbostock.github.com/d3/), which is a data-driven 
framework for information visualization. The visualiza- 
tion (Figure 3) shows the aligned part of the two networks, 
overlaying nodes and links using red color for the query 
and grey for the target network. Thus, a matched query- 
target node or link pair will be colored in both red and 
grey. This interactive network visualization shows the user 
which parts of the query and target networks are matched. 
Hovering over nodes and links displays tool-tips with pro- 
tein names and descriptions and link confidence, respec- 
tively, and allows for a quick overview of the alignment. If 
the user clicks on a node, information about that node is 
shown in a separate table, which in addition to the protein 
names and descriptions includes the bit score and £- value 
of the BLAST pairwise alignment and a hyperlink to the 
original database for more information about the target 
protein. The interface allows for a more detailed analysis 
by toggling the visibility of node labels, background target 
nodes and edges, unmatched query nodes and edges, and 
unmatched target edges. 

In addition, the detailed view shows tables contain- 
ing aligned query-target nodes, edges conserved in both 
query and target network, edges in the query network 
that remain unaligned, and unaligned edges in the tar- 
get network whose incident nodes are aligned (Figure 4). 
The interactive visualization can be exported to a static 
SVG file and the user can download the alignment and the 
interaction tables for further off-line analysis. We support 
Cytoscape [13] by providing Cytoscape-compatible files 
containing the entire alignment and query network as well 
as matched parts of the target network. 

Case study: Wnt signaling pathway 

To illustrate the capabilities of NatalieQ, we consider a 
biological case study involving the Wnt signaling pathway 
whose abnormal signaling has been associated with can- 
cer. This pathway is initiated by binding of secreted Wnt 
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Input 



Statistics 



Target network name 


data/string/dme-7227.string 


Target network size 


13144 nodes and 1996782 edges 


Query network size 


11 nodes and 17 edges 


Number of matcliing edges 


475 edges 


E-value cut-off 


1 


Confidence tliresliold 


10% 



Elapsed time 


0.208884s 


Edge correctness 


17/17 = 1 


Sequence contribution 


0.011376 


Optimality gap 


0% 


Number of aligned pairs 


11 


Conserved interactions 


17 


Non-conserved interactions in query 


0 


Non-conserved interactions in target 


22 



Figure 2 NATALiEQsummary statistics for run numbers (Emax = 1). Alignment of the Wnt query network against the target PPI networl< 
(STRING) of D. melanogaster using the topology score function. 



signaling proteins to the cell surface receptors Frizzled 
and LRR This causes the activation of the signaling pro- 
tein Dishevelled, which in turn inhibits the assembly of the 
degradation complex GSK-3^/axin/APC/^-catenin. As a 
result, the degradation of ^-catenin is prevented causing 
it to accumulate in the nucleus. There, ^-catenin forms 
a complex with LEF-l/TCF thereby displacing Groucho. 



The newly formed complex induces the transcription of 
various Wnt target genes, including c-myc which is a 
proto-oncogene encoding for a protein involved in cell 
growth and proliferation [14]. 

We manually constructed a PPI network of the pathway 
by using a subset of the proteins involved, namely WNTl, 
A2MR (LRPl), FZDl (Frizzled-1), DVLl (Dishevelled), 




(APC. 7227.FBpp00£ 
rT1,7227.FBpp0079060) 



Query: (Homo sapiens) 
(STRING-Protein- 
IC : 9606. bNSP00000293549) 
Target: 1: cWnt-1. 2: dInt-1. 
3: Protein int-1. 4: Protein 
Wnt-1. 5: Protein wingless. 



I (A2MR, 7227.FBpp0087868) 



E xport SVC I 

0 Show node labels Show unmatched query nodes/edges 

i: Show background target nodes/edges \^ Show unmatched target edges 

Figure 3 NatalieQ interactive visualization component showing the alignment of the Wnt query network (red) with the target PPI 
network (STRING, grey, matched part shown) of D. melanogaster using the sequence only score function at E-value cut-off 1 . The purely red 
edges, for example, (FZDl , A2MR), hint at interactions that have been missed by the alignment. See also Figure 4, bottom table. The tool-tip appears 
when hovering over the nodes. 
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Figure 4 Natali eQ alignment tables for the alignment of the Wnt query network against the target PPI network (STRING) of D. 
melanogaster using the sequence only score function at E-value cut-off 1 . Blue entries are links to the STRING database. 
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AXINl, GSK3B, CTNNBl (^-catenin), APC, TCF7, 
TLEl (Groucho), and MYC. For each of these pro- 
teins, we obtained their respective sequences from the 
STRING database. The edges we used correspond to 
the interactions described above. The query network 
consists of 11 nodes and 17 edges and is available 
as the example network file on the main page of 
NatalieQ. 

As a first sanity check, we queried against the human 
PPI network from STRING with link confidence thresh- 
old Cmin = 0.1. For all £-value cut-offs, NatalieQ found 
the optimal alignment where indeed all interactions are 
present and all query proteins are aligned with their iden- 
tical counterparts in the human network as we could 
verify from the descriptions and interaction tables in the 
output. 

For our next experiment, we used the PPI network 
of D. melanogaster as target. See also Figures 1-4 for 
an illustration. To study whether topological information 
improves comparative analysis, we compare the results 
of NatalieQ using both the topology and sequence only 
score functions. We see that in the resulting sequence 
only alignments for £-value cut-offs larger than 10~^^ 
one interaction of the query network is not mapped. 
This is the interaction between A2MR and FZDl. The 
counterpart of FZDl in the sequence only alignment is 
FBpp0075485 with a bit score of 519 (£-value: 5 • IQ-^^^). 
The web server also provides the BLAST output, which 
shows that FZDl is indeed sequence-wise most simi- 
lar to FBpp0075485. NatalieQ with the topology score 
function at £-value cut-offs larger than 10~^^ is able to 
match all (17) query interactions and pairs FZDl and 
FBpp0077788 with a bit score of only 150 (£-value: 6 • 
10~^^). Although the bit score is less than the one obtained 
in the sequence-only alignment, the interaction A2MR- 
FZDl is now present in the target network and has a 
normalized confidence of 0.172. So using NATALIEQ, 
we find that FZDl may functionally be more related to 
FBpp0077788 than its sequence-wise most similar coun- 
terpart FBpp0075485. This hypothesis is corroborated 
by UniProtKB/SwissProt annotation indicating that the 
protein FBpp0077788 contains a Frizzled domain. Run- 
ning the same example using the NetAligner web server 
[7] results in only 5 conserved interactions using default 
settings. 

This example illustrates how NATALIEQ can facilitate 
the transfer of functional annotation across species. For 
instance, we could transfer functional annotation con- 
cerning the Wnt pathway between the human and fly 
networks by using the alignments we obtained. 

Conclusions 

We developed NatalieQ, a web server for global pairwise 
network alignment of a pre-specified query PPI network 



to a selected target network. The underlying alignment 
method computes alignments with a worst-case bound on 
their quality. For the biological query networks we consid- 
ered, the optimality gap was closed and provably optimal 
alignments with respect to the used score function were 
thus found. The user can quickly get an overview of 
the alignment through the interactive visualization, where 
conserved and non-conserved interactions are easily 
visible. 

Currently, we support eight different target species from 
both STRING and IntAct. NatalieQ is extendible, and 
we will add more target networks in the future. In addi- 
tion, we plan to exploit the general applicability of the 
underlying NATALIE method by facilitating the identi- 
fication of network motifs through more sophisticated 
query networks where nodes are labeled by GO terms and 
edges are labeled by different interaction types, such as 
inhibition and activation. 

Availability and requirements 

• Project name: NatalieQ 

• Project home page: http://www.ibi.vu.nl/programs/ 
natalieq/ 

• Operating system(s): Platform independent 

• Programming language: PHP and Perl 

• Other requirements: modern web browser (Internet 
Explorer 9 or higher, Firefox, Chrome or Safari) 

• Any restrictions to use by non-academics: no 
license required 
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